Allow postponing dataset integrity checks in `NextGenHDFDataset` #1323

NeoLegends · 2023-05-05T13:00:29Z

Follow-up from #1315, where @JackTemaki noticed the NextGenHDFDataset goes through the entire data on startup to perform integrity checks -- and indeed RETURNN startup w/ that dataset is quite slow. This PR allows delaying these checks to training time, saving startup time at the cost of detecting data errors only later on.

albertz · 2023-05-08T11:32:30Z

Why do you put "chore:" into the description? Where do you get this from?

This is inconsistent to our normal commit messages, so we should not use this.

albertz · 2023-05-08T11:33:49Z

Does this really need to be a new option? I personally prefer to have fewer options if possible, esp if not really needed. Can't we just always do those integrity checks lazily?

albertz · 2023-05-08T11:35:10Z

Please check failing tests.

  returnn/datasets/hdf.py:548: WEAK WARNING PyIncorrectDocstringInspection: Missing parameter data in docstring
  returnn/datasets/hdf.py:548: WEAK WARNING PyIncorrectDocstringInspection: Missing parameter seq_name in docstring
  returnn/datasets/hdf.py:550: WARNING PyUnresolvedReferencesInspection: Function 'check_data_integrity' does not have a parameter 'numpy'
  returnn/datasets/hdf.py:550: WEAK WARNING PyIncorrectDocstringInspection: Unexpected parameter numpy in docstring
  returnn/datasets/hdf.py:551: WARNING PyUnresolvedReferencesInspection: Function 'check_data_integrity' does not have a parameter 'str'
  returnn/datasets/hdf.py:551: WEAK WARNING PyIncorrectDocstringInspection: Unexpected parameter str in docstring
  returnn/datasets/hdf.py:603: WEAK WARNING PyIncorrectDocstringInspection: Missing parameter data in docstring
  returnn/datasets/hdf.py:603: WEAK WARNING PyIncorrectDocstringInspection: Missing parameter seq_name in docstring
  returnn/datasets/hdf.py:605: WARNING PyUnresolvedReferencesInspection: Function 'check_data_integrity' does not have a parameter 'numpy'
  returnn/datasets/hdf.py:605: WEAK WARNING PyIncorrectDocstringInspection: Unexpected parameter numpy in docstring
  returnn/datasets/hdf.py:606: WARNING PyUnresolvedReferencesInspection: Function 'check_data_integrity' does not have a parameter 'str'
  returnn/datasets/hdf.py:606: WEAK WARNING PyIncorrectDocstringInspection: Unexpected parameter str in docstring
  returnn/datasets/hdf.py:668: WEAK WARNING PyIncorrectDocstringInspection: Missing parameter data in docstring
  returnn/datasets/hdf.py:668: WEAK WARNING PyIncorrectDocstringInspection: Missing parameter seq_name in docstring
  returnn/datasets/hdf.py:670: WARNING PyUnresolvedReferencesInspection: Function 'check_data_integrity' does not have a parameter 'numpy'
  returnn/datasets/hdf.py:670: WEAK WARNING PyIncorrectDocstringInspection: Unexpected parameter numpy in docstring
  returnn/datasets/hdf.py:671: WARNING PyUnresolvedReferencesInspection: Function 'check_data_integrity' does not have a parameter 'str'
  returnn/datasets/hdf.py:671: WEAK WARNING PyIncorrectDocstringInspection: Unexpected parameter str in docstring

Please use PyCharm to directly see those inspections.
In general, be consistent to other code.

albertz · 2023-05-08T11:35:54Z

I guess @JackTemaki and/or @patrick-wilken should otherwise review this.

NeoLegends · 2023-05-08T12:04:54Z

Can't we just always do those integrity checks lazily?

I don't really have an opinion for or against that here -- it's mainly detecting data format errors early vs. saving time. If you're sure your data is good there is no reason to do these checks eagerly.

NeoLegends added the enhancement label May 5, 2023

NeoLegends requested review from patrick-wilken and JackTemaki as code owners May 5, 2023 13:00

NeoLegends self-assigned this May 5, 2023

NeoLegends changed the title ~~chore: Allow postponing dataset integrity checks~~ chore: Allow postponing dataset integrity checks in NextGenHDFDataset May 5, 2023

NeoLegends force-pushed the chore/next-gen-hdf-data-type-assertion branch from 8936027 to a00d0ec Compare May 5, 2023 13:30

NeoLegends changed the title ~~chore: Allow postponing dataset integrity checks in NextGenHDFDataset~~ Allow postponing dataset integrity checks in NextGenHDFDataset May 8, 2023

Allow postponing dataset integrity checks to training time

a8ecbbf

NeoLegends force-pushed the chore/next-gen-hdf-data-type-assertion branch from 2f08d76 to a8ecbbf Compare May 8, 2023 12:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow postponing dataset integrity checks in `NextGenHDFDataset` #1323

Allow postponing dataset integrity checks in `NextGenHDFDataset` #1323

NeoLegends commented May 5, 2023 •

edited

Loading

albertz commented May 8, 2023

albertz commented May 8, 2023

albertz commented May 8, 2023

albertz commented May 8, 2023

NeoLegends commented May 8, 2023

Allow postponing dataset integrity checks in NextGenHDFDataset #1323

Are you sure you want to change the base?

Allow postponing dataset integrity checks in NextGenHDFDataset #1323

Conversation

NeoLegends commented May 5, 2023 • edited Loading

albertz commented May 8, 2023

albertz commented May 8, 2023

albertz commented May 8, 2023

albertz commented May 8, 2023

NeoLegends commented May 8, 2023

Allow postponing dataset integrity checks in `NextGenHDFDataset` #1323

Allow postponing dataset integrity checks in `NextGenHDFDataset` #1323

NeoLegends commented May 5, 2023 •

edited

Loading